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(54) Method for analyzing the performance of application programs 



(57) Method, systems and articles of manufacture 
consistent with the present invention collects and dis- 
plays performance data associated with executed pro- 
grams. A system consistent with an implementation of 
the present invention collects performance analysis 



information from various hardware and software compo- 
nents of an instrumented program, and displays the per- 
formance data in a multi-dimensional format 
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[0001] The present invention relates generally to 
performance analysis and more specifically to methods 
for providing a multidimensional view of performance 
data associated with an application program. 

Background 

[0002] Multi-threading is the partitioning of an appli- 
cation program into logically independent "threads- of 
control that can execute in parallel. Each thread 
includes a sequence of instructions and data used by 
the instructions to carry out a particular program task 
such as a computation or input/output function When 
X employing a data processing system with multiple proc- 
• essors. i.e., a multiprocessor computer system each 
processor executes one or more threads depending 
upon the number of processors to achieve mufti- 
processing of the program. 

[0003] A program can be mutb-threaded and still 
not ach ie ve multi-processmg « a ^ processor js 
used to execute all threads Wh,le a angle processor 
can execute instructions of only one thread at a time 
the processor can execute rrutt,pJe ihreads in parallel 
by for example execut,ng nstrucDons corresponding to 
one thread until reaching a selected reaction, sus- 
pending execution of that thread, and executing instruc- 
tions corresponding to another thread, until all threads 
have completed. In this scheme. as long as the proces- 
sor has started executing .nstructoos tor more than one 
thread dunng a given time Nerval al executing threads 
f!0j^ 10 ** " runnin 9" *» ,n B that tme interval. 
[0004] Multiprocessor computer systems are typi- 
cally used for executing appticabon programs intended 
to address complex computational problems in which 
drfferent aspects of a problem can be solved using por- 
tions of a program executing in parallel on different proc- 
essors. A goal associated with using such systems to 
execute programs is to achieve a high level of perform- 
ance, in particular, a level of performance that reduces 
the waste of the computing resources. Computer 
resources may be wasted, for example, if processors 
are idle (i.e., not executing a program instruction) for 
any length of time. Such a wait cycle may be the result 
of one processor executing an instruction that requires 
the result of a set of instructions being executed by 
another processor. 

[0005] It is thus necessary to analyze performance 
of programs executing on such data processing sys- 
tems to determine whether optimal performance is 
bang achieved. If not, areas for improvement should be 
identified. 

[0006] Performance analysis in this regard gener- 
ally requires gathering information in three areas The 
first considers the processor s state at a given time dur- . 



•ng program execution. A processors state refers to the 
portion of a program (for example, set of instructions 
thP nr^ 3 Subpr09ram ' l0 °P- « other code Wo*) that 
the processor ,s executing during a particular time inter- 
s val. The second considers how much time a processor 
spends m transition from one state to another. The third 
considers how close a processor is to executing at its 

£mn£ "T 6 - 771686 three areas d ° not P«»*e a 
complete analyse, however. They fail to address a 
io fourth component of performance analysis, namely pre- 

ZTZ ??™**™ did duri "9 a articular s^te 
(e.g., computation, input data, output data, etc ) 

[0007] when considering what a processor did 
while ,n a particular state, a performance analysis tool 
« can determine the affect of operations within a state on 
tiie performance level. Once these factors are identified 
rt .s posstole to synchronize operations that have a sig- 
mtant impact on performance with operations that 
have a less agnrficant impact, and achieve a better 

Z '?™ ance ,evel - For «* *2S 

may perform an operation that uses significant 
resources while another thread scheduled to perform a 
^! rati0n in Paral,el * e ^ad sits 

££2 : aWet °. rause ^ es «0"d thread to perform a 
drfferent operation that does not require the first thread 
to complete its operation, thus eliminating the idle 

so by both threads are better synchronized 

[OOTO] When a performance analysis tool reports a 
problem occurring in a particular state, but fails to relate 
he problem to other events occurring in an application 
(for example, operations of another state), the informa- 
* tion reported is relatively meaningless. To be usefoTa 
performance analysis tool must assist a developer in 
determining how performance information relates to a 
program's execution. Therefore, allowing a developer to 
determine the context in which a performance problem 

[0009] The process of gathering this information for 

Sn^T* 3nalySiS iS refefred ,0 as "Omenta- 
tion. nstrumentation generally requires adding instruc- 
tons to a program under examination so that when the 
P^ am,s «ecutedthe instructions generate data from 
which the performance information can be derived 
[0010] Current performance analysis tools gather 
data ,n one of two ways: subprogram level instrumenta- 
ion and bucket level instrumentation. A subprogram 

tracks the number of subprogram calls by instrumenting 
each subprogram with a set of instructions that gener 
ate data reflecting calls to the subprogram. It does not 

ss 3T* e,0P6r t0 P**"*"" data associated 
T Reoperations performed by each subprogram or a 
speeded portion of the subprogram, for exarrple by 
specifying data collection beginning and ending points 
within a subprogram. 
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[001 1] A bucket level instrumentation performance 
analysis tool divides the executable code into evenly 
spaced groups, or buckets. Performance data tracks the 
number of times a program counter was in a particular 
bucket at the conclusion of a specified time interval. $ 
This method of gathering performance data essentially 
takes a snapshot of the program counter at the speci- 
fied time interval. This method fails to provide compre- 
hensive performance information because it only 
collects data related to a particular bucket during the w 
specified time interval. ' 

[0012] The current performance analysis methods 
fail to provide customized collection or output of per- 
formance data. Generally, performance tools only col- 
lect a pre-specrfied set of data to display to a developer, is 

Summary of the Invention 

[001 3] Methods, systems, and articles of manufac- 
ture consistent with the present invention overcome the so 
shortcomings of the prior art by facilitating performance 
analysis of multi-threaded programs executing in a data 
processing system. Such methods, systems, and arti- 
cles of manufacture • analyze performance of threads 
executing in a data processing system by receiving data 25 
reflecting a state of each thread executing during a 
measurement period, and displaying a performance 
level corresponding .to the state of each thread during 
the measurement period. 

30 

Brief Description of the Drawings 

[0014] The accompanying drawings, which are 
incorporated in and constitute a part of this specifica- 
tion, illustrate an implementation of the invention and, 35 
together with the description, serve to explain the 
advantages and principles of the invention. In the draw- 
ings, 

FIG. 1 depicts a data processing system suitable for <o 
, implementing a performance analysis system con- 
sistent with the present invention; 
FIG. 2 depicts a block diagram of a performance 
analysis system operating in accordance with 
methods, systems, and articles of manufacture as 
consistent with the present invention; 
FIG. 3 depicts a flow chart illustrating operations 
performed by a performance analysis system con- 
sistent with an implementation of the present inven- 
tion; and so 
FIG. 4 depicts a multi-dimensional display of the 
performance data associated with an application 
program that has been instrumented in accordance 
with an implementation of the present invention. 

55 

Detailed Description 

[0015] Reference will now be made in detail to an 



implementation consistent with the present invention as 
illustrated in the accompanying drawings. Wherever 
possible, the same reference numbers will be used 
throughout the drawings and the following description to 
refer to the same or like parts. 

Overview 

[0016] Methods, systems, and articles of manufac- 
ture consistent with the present invention utilize per- 
formance data collected during execution pi an 
application program to illustrate graphically for the 
developer performance data associated with the pro- 
gram. The program is instrumented to generate the per- 
formance data during execution. Each program thread 
'performs one or more operations, each operation 
reflecting a different state of the thread. The perform- 
ance data may reflect an overall performance for each 
thread' as well as a performance level tor each state 
within a thread during execution. The developer can 
specify the type and extent of performance data to be 
collected. By providing a graphical display of the per- 
formance of all threads together, the developer can see 
where to make any appropriate adjustments to improve 
overall performance by better synchronizing operations 
among the threads. 

[0017]* 1 A performance analysis database access 
language is used to instrument the program in a manner 
consistent with the principles of the present invention. 
Instrumentation can be done automatically using known 
techniques that add instructions to programs at specific 
locations within the programs, or manually by a devel- 
oper. The instructions may specify collection of perform- 
ance data from multiple system components, for 
example, performance data may be collected from both 
hardware and the operating system. 
[0018] A four-dimensional display of performance 
data includes information on threads, times, states, and 
performance level. A performance analyzer also evalu- 
ates quantitative expressions corresponding to perform- 
ance metrics specified by a developer, and displays the 
computed value. 

Performance Analysis System 

[001 9] FIG. 1 depicts an exemplary data processing 
system 1 00 suitable for practicing methods and systems 
consistent with the present invention. Data processing 
system 1 00 includes a computer system 1 05 connected 
to a network 190, such as a Local Area Network, Wide 
Area Network, or the Internet. 
[0020] Computer system 105 contains a main 
memory 130, a secondary storage device 140. a proc- 
essor 1 50, an input device 1 70, and a video display 1 60. 
These internal components exchange information with 
one another via a system bus 165. The components are 
standard in most computer systems suitable for use with 
practicing methods and configuring systems consistent 
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with the present invention. One such computer system 
is the SPARCstation from Sun Microsystems, Inc. 
[0021] Although computer system 100 contains a 
single processor, it will be apparent to those skilled in 
the art that methods consistent with the present inven- 
tion operate equally as well with a multi-processor envi- 
ronment. 

[0022] Memory 130 includes a program 1 10 and a 
performance analyzer 115. Program 110 is a multi- 
threaded program. For purposes of facilitating perform- 
ance analysis of program 1 io in a manner consistent 
with the principles of the present invention, the program 
is instrumented with appropriate instructions of the 
developer's choosing to generate certain performance 
data. 

[0023] Performance analyzer 1 15 is comprised of 
two components. The first component 115a is a library 
\ of functions to be performed in a manner specified by 
< the instrumented program. The second component 
.1 15b is a developer interface that is used for two func- 
tions: (1) automatically instrumenting a program; and 
(2) viewing performance information collected when an 
instrumented program is executed. 
[0024] As explained, instrumentation can be done 
automatically with the use of performance analyzer 
interface 115b. According to this approach, the devel- 
oper simply specifies for the analyzer the type of per- 
formance data to be collected and the analyzer adds the 
appropriate commands from the performance analysis 
database access language to the program in the appro- 
priate places. Techniques for automatic instrumentation 
in this manner are familiar to those skilled in the art 
Alternatively, the developer may manually insert com- 
mands from the performance analysis database access 
language in the appropriate places in the program so 
that dunng execution specific performance data is 
recorded. The performance data generated during exe- 
cution of program 1 10 is recorded in memory, for exam- 
ple, main memory 130. 

[0025] Performance analyzer interface 11 5b per- 
mits developers to view performance information corre- 
sponding to the performance data recorded when 
program 110 is executed. As explained below the 
developer may interact with the analyzer to alter the 
view to d.splay performance information in various con- 
figurations to observe different aspects of the program's 
performance without having to repeatedly execute the 
program to collect information for each view, provided 
the program was properly instrumented at the outset 
Each v.ew may show (i) a complete measurement cycle 
for one or more threads: (ii) when each thread enters 
and leaves each state; and (iii) selected performance 
criteria corresponding to each state. 
[0026] Although not shown in FIG. 1 . like all compu- 
ter systems, system 105 has an operating system that 
controls its operations, including the execution of pro- 
gram 1 10 by processor 150. Also, although aspects of 
one implementation consistent with the principles of the 



present invention are described herein with perform- 
ance analyzer stored in main memory 120. one skilled 
m the art will appreciate that all or part of systems and 
methods consistent with the present invention may be 
5 stored on or read from other computer-readable media 
such as secondary storage devices, like hard disks' 
floppy disks, and CD-ROM; a carrier wave received 
from the Internet; or other forms of ROM or RAM 
Finally, although specific components of data process- 
10 mg system 100 have been described, one skilled in the 
art will appreciate that a data processing system suita- 
ble for use with the exemplary embodiment may contain 
additional or different components. 
[0027] FIG. 2 depicts a block diagram of a perform- 
« ance analysis system consistent with the present inven- 
tion As shown, program 210 consists of multiple 
threads 212. 214. 216, and 218. Processor 220 exe- 
cutes threads 212. 214. 216. and 218 in parallel Mem- 
ory 240 represents a shared memory that may be 
so accessed by all executing threads. A protocol for coordi- 
nating access to a shared memory is described in U S 
Patent Application Serial No. ^ 



Shaun Dennie. entitled "Protocol for Coordinating the 
Dilution of Shared Memory", (Attorney Docket No. 
25 06502-0207-00000). which is incorporated herein by 
reference. Although a single processor 220 is shown 
multiple processors may be used to execute threads' 
212, 214. 216. and 218. , 

,„ J 02 ? o J° faCi,itate para,,el &ec ^ on * m"»iPte 

2? 2 ' 2H 216 - 3nd 2m " <*"'*• ■*■»• 

partitions memory 240 into segments designated for 

operations associated with each thread and initializes 
the field of each segment. For example, memory seg- 
ment 245 is comprised of enter and exit state identifiers 
35 developer specified information, and thread identifica- 
tion information. An enter state identifier stores data 
corresponding to when, during execution, a thread 
enters a particular state. Similarly, an exit state identifier 
stores data corresponding to when, during execution of 
<o an application program, a thread leaves a particular 
state. Developer specified data represents the perform- 
ance analysis data collected. 

[0029] A reserved area of memory 250 is used to 
perform administrative memory management functions 
* such as, coordinating the distribution of shared memory' 
to competing threads. The reserved area of memory 
250 1 is also used for assigning identification information 
to threads using memory. 

[0030] The flow chart of FIG. 3 provides additional 
so details regarding the operation of a performance analy- 
sis system consistent with an implementation of the 
present invention. Instructions that generate perform- 
ance data are inserted into a program (step 305) The 
instrumented program is executed and the performance 
55 data are generated (steps 310 and 315). In response to 
a request to view performance data, performance ana- 
lyzer accesses and displays the performance data (step 
320). 
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[0031] Performance analyzer is capable of display- 
ing both the performance data and the related source 
code and assembly code, i.e., machine instructions, 
corresponding to the data. This allows a developer to 
relate performance data to both the source code and 5 
the assembly code that produced the data. 
[0032] FIG. 4 shows a display 400 with two parts 
labeled A and B. respectively. The first part, labeled A, 
shows the performance characteristics of an application 
program in four dimensions: threads, time, states, and 10 
performance. Performance information for each thread 
is displayed horizontally using a bar graph-type format. 
Time is represented on the horizontal axis; performance 
is represented on the vertical axis. 
[0033] Two threads, thread 1 and thread 2 in display 15 
400, were executing concurrently. As shown, the 
threads began executing at different times. The horizon- 
tal axis for thread 1 is labeled 402. Thread 1 began exe- 
cuting at a point in time labeled V on the horizontal axis 
402. The horizontal axis for thread 2 is labeled 404. 20 
Thread 2 began executing at time V, Each thread per- 
formed operations in multiple states, each state being 
represented by a different pattern. Thread 2 was idle at 
the beginning of the measuring period. One reason for 
this idle period may be that thread 2 was waiting for 25 
resources from thread 1. Based on this information, a 
developer can allocate operations of a thread among 
states such that performance will be improved, for 
example, by not executing concurrent operations that 
require use of the same system resources. 30 
[0034] As shown, thread 1 entered state 410 at a 
point in time V on the horizontal axis 402 and left state 
410 at time y, and entered state 420 at time "m" and 
left state 420 at time "n". The horizontal distance 
between points V and V is shorter than the horizontal 35 
distance between points "m" and "n". Therefore, thread 
1 operated in state 420 longer than it operated in state 
410. The vertical height of the bars show a level of per- 
formance. The vertical height for state 410 is lower than 
the vertical height for state 420, showing that states 41 0 ao 
and 420 operated at different levels of performance. The 
change in vertical height as an executing thread transi- 
tions from one state to another corresponds to changes 
in performance level. This information may be used to 
identify the affect of transitioning between consecutive 45 
states on performance, and directs a developer to areas 
of the program for making changes to increase perform- 
ance. 

[0035] The bottom-half of the display, labeled B, 
illustrates an expression evaluation feature of the per- so 
formance analyzer's interface. A developer specifies 
computational expressions related to a performance 
metric of a selected state(s). The performance analyzer 
computes the value of an expression for the perform- 
ance data collected. 55 
[0036] In the example shown, the developer has 
selected state 440. The expression on the first line, 
"NUM.OPS/ (100000'TIME)", is an expression for com- 



puting the number (in millions) of floating point instruc- 
tions per second (MFLOPS). The expression on the 
second line, "2*.CPU.IvlHZ" calculates a theoretical 
peak level of performance for a specified state. Perform- 
ance analyzer may evaluate these two expressions in 
conjunction to provide quantitative information about a 
particular state. For example, by dividing MFLOPS by 
the theoretical peak performance level for state 440, 
performance analyzer calculates for the developer the 
percentage of theoretical peak represented by each 
operation in state 440. 

Conclusion 

[0037] Methods and systems consistent with the 
present invention collect performance data from hard- 
ware and software components of an application pro- 
gram, allowing a developer to understand how 
performance data relates to each thread of a program 
and complementing a developer's' ability to understand 
and subsequently diagnose performance issues occur- 
ring in a program. 

[0038] Although the foregoing description has been 
described with reference to a specific implementation, 
those skilled in the art will know of various changes in 
form and detail which may be made without departing 
from the spirit and scope of the present invention as 
defined in the appended claims and the full scope of 
their equivalents. ' 

Claims 

1. A method for analyzing performance of threads 
executing in a data processing system, the method 
comprising: 

receiving data reflecting a state of each thread 
executing during a measuring period; and 
displaying a performance level corresponding 
to the state of each thread during the measur- 
ing period. 

2. The method of claim 1 wherein receiving includes: 

designating the data to be collected during the 
measuring period. 

3. The method of claim 1 wherein displaying includes: 

accessing an data fie created during program 
execution. 

4. A method for analyzing performance of threads 
executing in a data processing system, the method 
comprising: 

inserting instructions in a program that gener- 
ate data reflecting a performance level while a 
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program is executing; 

executing the program; and 

displaying the performance data according to. 

the state of a thread operating in the program 

during a measuring period. 5 

A method for analyzing performance of threads 
executing in a data processing system, the method 
comprising: 



6. 



receiving a program instrumented with instruc- 
tions that generate performance data; 
executing the program and collecting the per- 
formance data; and 

displaying a performance level corresponding 
to the state of each thread during a measure- 
ment period. 

A method for analyzing performance of threads 
executing in a multi-processor computing system, 
the method comprising: 

designating a particular time interval of an exe- 
cuting program to collect performance data- 
and 

displaying data corresponding to a perform- 
ance level generated during execution of the 
program. 

A system for collecting performance data associ- so 
ated with threads executing in a multi-processor 
environment including: 



10 



is 



20 



25 , 



an expandable set of commands for designat- 
ing collection of performance data related to a 
thread's execution in a particular state; and 
a storage area for storing the performance data 
collected. 

8- A system for displaying performance data associ- 
ated with threads executing in a mult-processor 
environment including: 

a set of performance data generated from an 
instrumented program; and 
a display of a performance level of each thread 
corresponding to the performance data. 

9. A method for analyzing performance of an applica- 
tion program executing in a data processing sys- 
tem, wherein the application program is executed 
using multiple threads and each executing thread 
passes through multiple states, the method com- 
prising: 



35 



40 



45 



SO 



reading data corresponding to a predetermined 
performance-related aspect of the application 
program for more than one state of each thread 



55 



during a measuring period; and 
displaying a performance level corresponding 
to the predetermined performance-related 
aspect of the application program and reflect- 
ing changes in performance at every point in 
time in each state. 
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